An HMM trajectory tiling (HTT) approach to high quality TTS
نویسندگان
چکیده
We propose an HMM Trajectory Tiling (HTT) approach to high quality TTS, which is our entry to Blizzard Challenge 2010. In HTT, first refined HMM is trained with the Minimum Generation Error (MGE) criterion; then trajectory generated by the refined HMM is to guide the search for finding the closest waveform segment “tiles” in synthesis. Normalized distances between HMM trajectory and those of the waveform unit candidates are used for selecting final candidates in a unit sausage (lattice). Normalized cross-correlation, a good concatenation measure for its high relevance to spectral similarity, phase continuity and concatenation time instants, is used for finding the best unit sequence in the sausage. The sequence serves as the best segment tiles to closely follow the HMM trajectory guide. Tested in four tasks, {EH1, EH2, MH1 and MH2}, of Blizzard Challenge 2010, the new HTT approach delivers high quality, natural sounding TTS speech without sacrificing high intelligibility. Subjectively, they are confirmed by naturalness and intelligibility listening test scores.
منابع مشابه
A hybrid TTS between unit selection and HMM-based TTS under limited data conditions
The intelligibility of HMM-based TTS can reach that of the original speech. However, HMM-based TTS is far from natural. On the contrary, unit selection TTS is the most-natural sounding TTS currently. However, its intelligibility and naturalness on segmental duration and timing are not stable. Additionally, unit selection needs to store a huge amount of data for concatenation. Recently, hybrid a...
متن کاملGenerating natural F0 trajectory with additive trees
In HMM-based TTS, while the segmental quality of synthesized speech is quite acceptable, intonation, especially at the sentence level, tends to be somewhat bland. The maximum likelihood (ML) criterion used in HMM training and parameter trajectory generation is partially responsible for the blandness. Additionally, the F0 trajectory thus generated has a smaller dynamic range than that of natural...
متن کاملAdvances in Spectral Parameterization for Statistical (HMM-Based) TTS
HMM-based parametric speech synthesis has recently become an alternative to the concatenative TTS approach, especially when low footprint and general speech domain are required. A majority of speech parameterization models used in state-ofthe art HMM TTS systems employ source-filter waveform synthesis schemes. Sinusoidal representation and waveform generation of speech is an alternative to the ...
متن کاملTTS synthesis with bidirectional LSTM based recurrent neural networks
Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems [1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, th...
متن کاملAutomatic Sentence Selection from Speech Corpora Including Diverse Speech for Improved HMM-TTS Synthesis Quality
Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis quality despite halving the training corpus. To handle large amounts of data an automatic approach is p...
متن کامل